In this tutorial, we start with simple R code, and we then introduce basic statistics and probabilities. Utilizing R code and packages to learn statistics methods is the core of this webpage. (here’s the presentation)
R
Probability and Statistics
Data with R
Statistics with R
Model with R
Why is it important to learn statistics, and why do we use R for data analysis? In the current era characterized by big data, extensive information, and artificial intelligence (AI), statistical models form the cornerstone of AI and machine learning. Statistics provide a comprehensive framework for explaining outcomes, parameters, and nearly all aspects of these models. By utilizing R, a powerful statistical computing language, researchers can efficiently analyze and interpret complex datasets, thereby advancing the field of data science.
J.D. Long, P. Teetor (2019) present a R cookbook by bookdown. Start with install R and its compiler R studio.
The R studio page display 4 windows unless nav-bar, The upper left
window is the opening file, I prefer utilize .Rmd (R mark down). As FIG
1., we code between the area and see the result below. You also could
type the code behind the > on lower left window, and
click enter to run the code. Two right windows illustrate the variable,
function, and packages, we will introduce how they work later.
For the problem of using <- or =, we
will see the same result for setting variables but my suggestion is
<-.
x <- 100
x = 100
x
## [1] 100
int <- 10
int
## [1] 10
chr <- "I am a character"
chr
## [1] "I am a character"
Be careful the variable name, such as c is a function fro creating vectors.
V1 <- c(1:10)
V2 <- c("2", 2)
V1
## [1] 1 2 3 4 5 6 7 8 9 10
V2
## [1] "2" "2"
\[ f(x)=x^2+2x+1\text{, and find } f(\pi). \]
f <- function(x){ x^2+2*x+1 }
f(pi)
## [1] 17.15279
Using ?function_name to see detail in th function.
F1 <- factor(c(1,0,1,1,0,1))
F1
## [1] 1 0 1 1 0 1
## Levels: 0 1
List1 <- list(integer1 = int,
character1 = chr,
vector1 = V1,
function1 = f,
factor1 = F1)
List1
## $integer1
## [1] 10
##
## $character1
## [1] "I am a character"
##
## $vector1
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $function1
## function(x){ x^2+2*x+1 }
##
## $factor1
## [1] 1 0 1 1 0 1
## Levels: 0 1
M1 <- matrix(c(1,2,1,
0,2,7),
ncol = 3,nrow = 2,byrow = T)
M1
## [,1] [,2] [,3]
## [1,] 1 2 1
## [2,] 0 2 7
M2 <- matrix(c(1,0,0,
0,1,0,
0,0,1), ncol = 3, nrow = 3, byrow = T)
M2
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
M1%*%M2
## [,1] [,2] [,3]
## [1,] 1 2 1
## [2,] 0 2 7
df <- data.frame(Name = c("Sam", "Yuji", "Lan", "Elaine"),
Age = c(28,24,NA,NA),
Height = c(177, 187, 159, 164),
Gender = factor(c("M", "M", "F", "F")))
df
When you repeat the same things many times.
buliding <- c()
for(Floor in 1:101){
buliding <- rbind(Floor, buliding)
}
buliding
## [,1]
## Floor 101
## Floor 100
## Floor 99
## Floor 98
## Floor 97
## Floor 96
## Floor 95
## Floor 94
## Floor 93
## Floor 92
## Floor 91
## Floor 90
## Floor 89
## Floor 88
## Floor 87
## Floor 86
## Floor 85
## Floor 84
## Floor 83
## Floor 82
## Floor 81
## Floor 80
## Floor 79
## Floor 78
## Floor 77
## Floor 76
## Floor 75
## Floor 74
## Floor 73
## Floor 72
## Floor 71
## Floor 70
## Floor 69
## Floor 68
## Floor 67
## Floor 66
## Floor 65
## Floor 64
## Floor 63
## Floor 62
## Floor 61
## Floor 60
## Floor 59
## Floor 58
## Floor 57
## Floor 56
## Floor 55
## Floor 54
## Floor 53
## Floor 52
## Floor 51
## Floor 50
## Floor 49
## Floor 48
## Floor 47
## Floor 46
## Floor 45
## Floor 44
## Floor 43
## Floor 42
## Floor 41
## Floor 40
## Floor 39
## Floor 38
## Floor 37
## Floor 36
## Floor 35
## Floor 34
## Floor 33
## Floor 32
## Floor 31
## Floor 30
## Floor 29
## Floor 28
## Floor 27
## Floor 26
## Floor 25
## Floor 24
## Floor 23
## Floor 22
## Floor 21
## Floor 20
## Floor 19
## Floor 18
## Floor 17
## Floor 16
## Floor 15
## Floor 14
## Floor 13
## Floor 12
## Floor 11
## Floor 10
## Floor 9
## Floor 8
## Floor 7
## Floor 6
## Floor 5
## Floor 4
## Floor 3
## Floor 2
## Floor 1
\[ \Large (\Omega, \mathscr{F}, P) \]
\(\Omega\): Sample space, all of the occurring events.
\(\mathscr{F}\): \(\sigma\)-algebra, hard to explained, simply define \(\Omega\in\mathscr{F}\), fixed the math.
\(P\): probability measure, the value of the occurring event, and between 0 and 1.
Probability space of tossing a fair coin, then:
\(\Omega = \{H,T\}\)
\(\mathscr{F} = \{H,T\}\)
\(P(\{H\})=P(\{T\})=\frac{1}{2}\)
Three Axiom
\(0\leq P(A)\leq 1,\;\forall A\)
\(P(\Omega) = 1\)
\(P(\bigcup_{i=1}^n A_i)=\sum_{i=1}^n P(A_i), \text{where }A_i\cap A_j=\varnothing,\;\forall i,j\)
Use X,Y,Z,… define random variables, and x,y,z,.. define values.
Random Variables \(X\) includes all possible outcomes.
Probability is denoted by \(P(X=x)=f(x)\).
Mean: \(E(X) = \sum_{x=1}^nx_if(x) = \mu\)
variance:\(Var(X) = \sum_{x=1}^n(x-\mu)^2f(x)= E(X-E(X))^2=E(X^2)-(E(X))^2\)
\[ \Large X\sim B(n,p) \]
\(n\) and \(p\) are parameters of distribution
\(n\): n times for trail
\(p\): probability of event for each trail
Probability Mass Function (pmf):
\[ f_X(x) = \binom{n}{x}\;p^xq^{n-x},\;\forall x = 1, \cdots, n \]
Probability is denoted by \(P(X\leq x)=\int_{-\infty}^xf(x)dx\), then \(P(X=x)=0\)
Mean: \(E(X) = \int_\mathbb {r class.output="code-output"}xf_X(x)dx = \mu\)
variance:\(Var(X) = \int_\mathbb {r class.output="code-output"}(x-\mu)^2f_X(x)dx = E(X-E(X))^2=E(X^2)-(E(X))^2\)
\(X\sim\exp(\lambda)\)
Probability Density Function (pdf.) \(f_X(x) = \lambda e^{-\lambda x}\)
f <- function(x, lambda = 1){lambda*exp(-lambda*x)}
curve(f,from = 0, to = 5, col = 1)
curve(f(x,2),from = 0, to = 5, col = 2, add = TRUE)
curve(f(x,.5),from = 0, to = 5, col = 3, add = TRUE)
\[ \Large X\sim N(\mu, \sigma^2) \]
\[ f_X(x) = \cfrac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}},\;\forall\;x\in\mathbb``` {r class.output="code-output"} \]
\(X\sim t_n\)
\(t_n\overset{d}{=}\color {r class.output="code-output"ed}{N(0,1)}\;\) as \(\;n\rightarrow\infty\)
curve(dt(x,df=1),from=-4,to=4,col="#000000",ylim=c(0,0.4))
curve(dt(x,df=2),from=-4,to=4,col="#240000",add=T)
curve(dt(x,df=3),from=-4,to=4,col="#460000",add=T)
curve(dt(x,df=4),from=-4,to=4,col="#690000",add=T)
curve(dt(x,df=5),from=-4,to=4,col="#800000",add=T)
curve(dt(x,df=8),from=-4,to=4,col="#990000",add=T)
curve(dt(x,df=10),from=-4,to=4,col="#bb0000",add=T)
curve(dt(x,df=50),from=-4,to=4,col="#cc0000",add=T)
curve(dt(x,df=91),from=-4,to=4,col="#dd0000",add=T)
curve(dt(x,df=100),from=-4,to=4,col="#ee0000",add=T)
curve(dnorm(x,0,1),from=-4,to=4,col="#ff0000",add=T)
\[ \mathcal{X}=(X_1, X_2,\cdots, X_{100}),\;\forall\;X_i\overset{iid}{\sim} N(160,10) \]
set.seed(19970608)
#random
x <- rnorm(n = 100, mean = 160, sd = sqrt(10))
sort(x, decreasing = FALSE)
## [1] 150.2045 150.3638 151.9515 152.4867 153.2311 153.9550 155.0583 155.1287
## [9] 155.4500 155.8637 156.0542 156.3821 156.4040 156.6807 156.8222 156.9162
## [17] 156.9413 157.3930 157.4179 157.4487 157.4901 157.5814 157.6270 157.8093
## [25] 158.1321 158.1779 158.1841 158.2011 158.3338 158.4286 158.5381 158.5747
## [33] 158.6128 158.6130 158.6211 158.7004 158.8500 158.9164 158.9294 158.9382
## [41] 158.9697 159.0128 159.0373 159.3363 159.4597 159.5536 159.6015 159.7461
## [49] 159.8022 159.8833 159.8919 159.9494 159.9613 159.9639 160.1277 160.2413
## [57] 160.2432 160.2516 160.2878 160.4307 160.5644 160.5740 160.7001 160.8122
## [65] 160.8769 160.8916 160.9366 160.9986 161.0055 161.0284 161.0291 161.2953
## [73] 161.3876 161.4904 161.5726 161.6253 161.6258 161.6970 161.7136 161.8119
## [81] 162.3868 162.4849 162.5548 162.6081 163.2007 163.2842 163.4371 163.7381
## [89] 163.8430 163.8762 164.2689 164.4121 164.4778 165.3108 165.5170 165.9530
## [97] 166.4605 167.1191 167.4459 168.5680
hist(x, nclass = 20)
library(kableExtra)
df <- read.csv(file = "csv/df.csv", fileEncoding = "Big5")
kable(df)
| Name | Height | Weight | Gender | Age |
|---|---|---|---|---|
| Sam Lu | 177 | 82 | M | 28 |
| Lan Huang | 159 | 45 | F | 21 |
| Yuji Nishida | 187 | 89 | M | 24 |
| Elaine Liao | 164 | 53 | F | 27 |
library(randomNames)
## Warning: package 'randomNames' was built under R version 4.4.1
set.seed(19970608)
New_df.M <- data.frame(Name = randomNames(n = 68, gender = 0, which.names = "first"),
Height = rnorm(68, 175, 10),
Weight = rnorm(68, 70, 10),
Gender = rep("M", 68),
Age = sample(20:40, size = 68, replace = TRUE))
New_df.M
library(randomNames)
set.seed(19970607)
New_df.F <- data.frame(Name = randomNames(n = 28, gender = 1, which.names = "first"),
Height = rnorm(28, 160, 10),
Weight = rnorm(28, 50, 10),
Gender = rep("F", 28),
Age = sample(20:40, size = 28, replace = TRUE))
New_df.F
New_df <- rbind(df, New_df.M, New_df.F) #cbind()
New_df
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
New_df <- New_df%>%
mutate(BMI = Weight/(Height/100)^2)%>%
transform(Gender = as.factor(Gender),
BMI = round(BMI, digits = 1))
New_df
summary(New_df)
## Name Height Weight Gender Age
## Length:100 Min. :141.8 Min. :33.26 F:30 Min. :20.00
## Class :character 1st Qu.:164.2 1st Qu.:53.76 M:70 1st Qu.:25.00
## Mode :character Median :171.9 Median :62.46 Median :31.00
## Mean :171.1 Mean :62.89 Mean :30.21
## 3rd Qu.:178.5 3rd Qu.:72.78 3rd Qu.:35.00
## Max. :198.8 Max. :96.98 Max. :40.00
## BMI
## Min. :10.50
## 1st Qu.:18.38
## Median :21.40
## Mean :21.48
## 3rd Qu.:24.93
## Max. :33.70
New_df%>%filter(BMI<13)
New_df%>%filter(Height>190)
New_df%>%filter(Age==28)
\[ H_0: \mu=170 \]
\[ H_1: \mu\neq 170 \]
t.test(New_df$Height, mu=170)
##
## One Sample t-test
##
## data: New_df$Height
## t = 0.9556, df = 99, p-value = 0.3416
## alternative hypothesis: true mean is not equal to 170
## 95 percent confidence interval:
## 168.8563 173.2688
## sample estimates:
## mean of x
## 171.0625
\[ H_0: \mu>170 \]
\[ H_1: \mu\leq 170 \]
t.test(New_df$Height, mu=170, alternative = "greater") #alternative = "less", when H_0:\mu<170
##
## One Sample t-test
##
## data: New_df$Height
## t = 0.9556, df = 99, p-value = 0.1708
## alternative hypothesis: true mean is greater than 170
## 95 percent confidence interval:
## 169.2163 Inf
## sample estimates:
## mean of x
## 171.0625
\[ H_0: \mu_1<\mu_2 \]
\[ H_1: \mu_1\geq\mu_2 \]
BMI_M <- New_df%>%filter(Gender=="M")
BMI_M <- sample(BMI_M$BMI, 30, replace = F)
BMI_F <- New_df%>%filter(Gender=="F")
BMI_F <- BMI_F$BMI
#test
t.test(BMI_M, BMI_F, alternative = "less")
##
## Welch Two Sample t-test
##
## data: BMI_M and BMI_F
## t = 3.7572, df = 57.764, p-value = 0.9998
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 5.818209
## sample estimates:
## mean of x mean of y
## 22.17333 18.14667
\[ Y = \beta_0+\beta_1X+\varepsilon\;,\;\text{where }\;\varepsilon\sim N(0,1) \]
glm1 <- glm(Height~Weight, data = New_df)
summary(glm1)
##
## Call:
## glm(formula = Height ~ Weight, data = New_df)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.72221 4.57094 32.974 < 2e-16 ***
## Weight 0.32341 0.07086 4.564 1.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 103.0028)
##
## Null deviance: 12240 on 99 degrees of freedom
## Residual deviance: 10094 on 98 degrees of freedom
## AIC: 751.24
##
## Number of Fisher Scoring iterations: 2
library(plotly)
est_beta <- glm1$coefficients
New_est <- data.frame(Height = est_beta[1]+est_beta[2]*New_df$Weight,
Weight = New_df$Weight)
plot_ly(New_df, x = ~Weight,
y = ~Height,
text = New_df$Name,
type = 'scatter',
alpha = 0.65,
mode = 'markers',
name = 'Weight-Height')%>%
add_trace(data = New_est,
x = ~Weight,
y = ~Height,
name = 'Regression Fit',
mode = 'lines')
library(plotly)
est_beta <- glm1$coefficients
New_est <- data.frame(Height = est_beta[1]+est_beta[2]*New_df$Weight,
Weight = New_df$Weight)
plot_ly(New_df,
x = ~Weight,
y = ~Height,
text = New_df$Name,
color = ~New_df$Gender,
type = 'scatter',
alpha = 0.65, mode = 'markers',
name = 'Weight-Height')%>%
add_trace(data = New_est,
x = ~Weight,
y = ~Height,
name = 'Regression Fit',
mode = 'lines')
library(plotly)
est_beta <- glm1$coefficients
New_est <- data.frame(Height = est_beta[1]+est_beta[2]*New_df$Weight,
Weight = New_df$Weight)
plot_ly(New_df, x = ~Weight,
y = ~Height,
text = New_df$Name,
color = ~New_df$Age,
type = 'scatter',
alpha = 0.65,
mode = 'markers',
name = 'Weight-Height')%>%
add_trace(data = New_est,
x = ~Weight,
y = ~Height,
name = 'Regression Fit',
mode = 'lines')
Model: \(Y=X\beta+\varepsilon\), for \(\varepsilon\sim N(0,1)\)
Estimate \(\beta_0\), \(\beta_1\)
\[ Y = \begin{pmatrix} Y_1\\ \vdots\\ Y_n \end{pmatrix} ,\; X = \begin{pmatrix} 1 & X_1\\ \vdots & \vdots\\ 1 & X_n \end{pmatrix} ,\;\text{and }\; \beta = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \]
LSE (Least Squared Estimation) to estimate \(\beta\)
Estimation of \(Y\): \(\hat{Y}=X\hat{\beta}\)
Minimized:
\[ (Y-X\hat{\beta})^T(Y-X\hat{\beta}) \]
\[ \cfrac{\partial}{\partial\hat{\beta}} (Y-X\hat{\beta})^T(Y-X\hat{\beta})=0 \]
\[ \hat{\beta} = (X^TX)^{-1}X^TY \]